Statistical thinking — generalized linear models
2025-10-31
A Gaussian linear models describes the expected value (mean) of the response as a linear combination of the predictor variables
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij} + \varepsilon_i\]
Implies that a constant change in a predictor leads to a constant change in the response
This is intuitive for a response that has a normal distribution: can vary effectively indefinitely in either direction, or over small range (e.g. heights)
This is inappropriate for a wide range of types of response variables common in animal science data sets
Animal science data are often discrete or categorical
Binary (dichotomous) variables
Count variables
Also continuous, positive (or non-negative) data such as
When we model we make assumptions regardless of the model we fit — the Gaussian linear models make several assumptions
One assumption is that we want to model each value of the response \(y_i\) as being a draw from a Gaussian random variable with mean \(\mu_i\) and standard deviation \(\sigma\)
We write this as
\[ y_i \sim \mathcal{N}(\mu_i, \; \sigma) \]
For \(\sim\) read is distributed as and \(\mathcal{N}\) is the is normal or Gaussian distribution
The mean of the Gaussian random variable is the expected value of \(y_i\) given a value of \(x_i\)
By expected value we mean the typical or most likely value
If we are being fancy, we write this expected value as \(\mathbb{E}(y_i)\)
And we state: \(\mathbb{E}(y_i) = \mu_i\)
The expected value of \(y_i\) is equal to \(\mu_i\), the mean of the Gaussian random variable
We model this expected value through the linear predictor
But instead thinking of us modelling \(y_i\) directly like we have before
\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij} + \textcolor{red}{\varepsilon_i}\]
we can think of modelling the expected value of \(y_i\)
\[\mathbb{E}(y_i) = \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \cdots + \beta_j x_{ij}\]
Now we don’t have those model errors (\(\textcolor{red}{\varepsilon_i}\))anywhere in the linear predictor
But where did the errors \(\textcolor{red}{\varepsilon_i}\) go?
To answer that, let’s draw some pictures of what we are doing
We have some data \(x_i\) that we use to model \(y_i\)
Figure 1
We fit the model \(\mathbb{E}(y_i) = \mu_i = \beta_0 + \beta_1 x_i\) by finding the linear that minimises the sum of squared errors
Figure 2
But we assume that each data point \(y_i\) comes from a Gaussian distribution with mean \(\mu_i\) and standard deviation \(\sigma\)
Figure 3
Orange line = fitted relationship
Red dots = fitted value \(\hat{y}_i\)
Green dots = selected observed \(y_i\)
Black lines = fitted Gaussian distribution for each green dot
Note the width of each Gaussian is the same: we assume that the variance of the data \(y_i\) is the same for all observations
Note that the Gaussians extend into negative values of \(\mathbf{y}\)
A binary or dichotomous variable can take only one of two values
Numerically these are coded as 0 and 1 when we fit a logistic regression
1 is generally known as the success state; more usefully this indicates cases where risk factor or outcome of interest is present,0 is the failure state; more usefully the absence of the risk factor or outcome of interestAlternatively, we may have a data representing a proportion of successes from some total in a set of \(m\) trials or experiments
Such variables are strictly bounded at 0 and 1
Association between high somatic cell count (SCC) & milk yield in dairy cows
Define high SCC as SCC > 125,000 cells ml-1
Linear regression line could predict negative probability of high SCC
Variance unlikely to be constant; reduced at 0 & 1
Would like to fit an s-shaped curve to the data but we want a linear equation
Can achieve both aims via a transformation of the s-shaped response to a straight line: logit transformation
The logistic regression is given by the following equation
\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]
\(\displaystyle \frac{p}{1 - p}\) is the odds, a measure that compares the probability of an event happening with the probability that it does not happen
If \(p\) = 0.25 the odds are \(\frac{0.25}{1 - 0.25} = \frac{1}{3}\) or 1 success for every 3 failures
If \(p\) = 0.8 the odds are \(\frac{0.8}{1 - 0.8} = 4\) or 4 success for each failure
The logit is the natural log of the odds; odds are non-negative number, but log odds can be any real number
These are all transformations that we can undo / reverse!
\[\log \left ( \frac{p}{1 - p} \right ) = \beta_0 + \beta_{1} x_1\]
\(\beta_0\) a constant term; predicted log-odds of success (high SCC) where \(x\) is 0 (0 milk yield)
\(\beta_1\) is the slope; predicted change in the log-odds of success (high scc) for a 1 unit increase in \(x\) (an additional KG of milk per day)
Because log-odds can take any value
A method known as maximum likelihood (ML) is used to estimate values of \(\beta_0\) and \(\beta_j\) from the data, the maximum likelihood estimates (MLEs)
So called because these estimates are the ones that make the data most likely under the model
In linear regression we minimised the residual sums of squares
With GLMs we minimise the deviance — think of this as the lack of fit of the model
Call:
glm(formula = high_scc ~ milk_yield_kg, family = binomial, data = scc)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.819109 0.195370 4.193 2.76e-05 ***
milk_yield_kg -0.035740 0.009306 -3.841 0.000123 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1577.2 on 1139 degrees of freedom
Residual deviance: 1562.2 on 1138 degrees of freedom
AIC: 1566.2
Number of Fisher Scoring iterations: 4
The output looks the same, so do not sweat the detail if it is unclear 😰
Can compare likelihoods of two models using a likelihood ratio test
Analysis of Deviance Table
Model 1: high_scc ~ 1
Model 2: high_scc ~ milk_yield_kg
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1139 1577.2
2 1138 1562.2 1 15.056 0.0001044 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that compared to the null model of no change in risk of high SCC, the milk yield of a cow has a statistically interesting effect on the probability of high SCC
We would reject H0: the data we observed is very unlikely to have arisen if H0 were true (about 1 in 10000 chance)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.819109 0.195370 4.193 2.76e-05 ***
milk_yield_kg -0.035740 0.009306 -3.841 0.000123 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Intercept) milk_yield_kg
2.268 0.965
Estimate contains the estimates for \(\beta_j\) in log odds; easier to interpret on odds scale
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.819109 0.195370 4.193 2.76e-05 ***
milk_yield_kg -0.035740 0.009306 -3.841 0.000123 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tests of \(H_0: \; \beta_j = 0\) are done using a Wald statistic instead of the \(t\) test
\[z = \frac{\beta_j - 0}{\mathsf{SE}(\beta_j)}\]
Rather than following a \(t\) distribution, Wald statistics are approximately normally distributed; probability of seeing a \(z\) as extreme as observed under a standard normal distribution
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.819109 0.195370 4.193 2.76e-05 ***
milk_yield_kg -0.035740 0.009306 -3.841 0.000123 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Intercept) milk_yield_kg
2.268 0.965
Odds of high SCC as a non-yielding cow are 2.268
\(\beta_1\) is negative, so higher yielding cows are associated with decreased risk of high SCC; for each additional kg of yield, odds of high SCC decrease 0.965 times
Why times? additive effects on the log scale are multiplicative on the non-log scale
Odds ratio of High SCC where milk_yield_kg is 20 and 25 kg
(Note: \(\beta_1\) = -0.036)
Don’t do it by hand, use comparisons()
Estimate Pr(>|z|) S 2.5 % 97.5 %
0.836 <0.001 13.0 0.763 0.916
Term: milk_yield_kg
Type: response
Comparison: ln(odds(25) / odds(20))
transform = "exp" because want inverse of log(); "lnoravg" requests the log-odds ratio as the comparison
If those sound complicated, welcome to the club because they are!
It is much more useful to describe the model in terms of the predicted probabilities or as risk ratios
A risk ratio or relative risk is the ratio of the risk under two conditions
\[ \text{risk} = \frac{\text{number of cases}}{\text{number at risk}} \]
\[ \text{risk ratio} = \frac{\text{risk}_a}{\text{risk}_b} \]
Compute the risks: Estimate is probability of SCC for the two milk yields (20 & 25)
milk_yield_kg Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
20 0.526 0.0149 35.3 <0.001 906.2 0.497 0.555
25 0.481 0.0189 25.5 <0.001 472.4 0.444 0.518
Type: response
We want the ratio of these two risks (0.526 & 0.481):
If we want the predicted probability of high SCC for our two values of milk yield, these are the model predictions
Note: the response in a logistic regression model is the probability of “success”, where in this case “success” is “high SCC”
We just saw these probabilities, but it is worth repeating them
milk_yield_kg Estimate Pr(>|z|) S 2.5 % 97.5 %
20 0.526 0.0806 3.6 0.497 0.555
25 0.481 0.3262 1.6 0.445 0.519
Type: invlink(link)
We state which values of milk_yiled_kg we want predictions for using datagrid()
We can do this for more than just a couple of values of milk_yield_kg
plot_predictions() will draw graphs of the model predictions conditioned on milk_yield_kg
Here, we extrapolate well beyond range of the data
Approximation used to create the confidence band starts to fail
Generalised Linear Models (GLMs) are a powerful extension to the linear regression model, extending the types of data & conditional distributions that can be modelled beyond the normal or Gaussian distribution of linear regression
Binary (dichotomous) variables can be modelled using logistic regression
Count data can be modelled using Poisson regression
Milk yield, animal weight, or other positive, continuous variables can be modelled using a gamma regression
All are special cases of the broad class of GLMs
Also includes linear regression as a special case
Logistic regression is a special case of the the GLM where the conditional distribution of the response was assumed to be Binomial
GLMs allow the conditional distribution of the response to be any distribution from the exponential family; Poisson, binomial, Gaussian, gamma, multinomial, …
There are three parts to a GLM
Whilst this affords a wealth of options, often natural choices for the conditional distribution of \(y\) and the link function arise from type of data being modelled
In a GLM we want a model for the expectation of \(y\), \(\mathbb{E}(y_i)\), which here will usually be the mean of the response, \(\mu_i\)
We might model \(\mu_i\) as following a Poisson distribution if the data were counts, or as a binomial distribution in the case of 0/1 data, as we did the in the high SCC example
We need to decide which predictor variables and any transformations of them should be used to predict \(y\); this is the linear predictor, \(\eta\)
\[\eta_i = \beta_0 + \beta_{i1} x_{i1} + \beta_{i2} x_{i2} + \cdots + \beta_{ij}\]
However, was we saw in the Gaussian linear model, there is nothing stopping the above equation returning values that don’t make sense for the variable we are modelling
So we need to map the values from the response scale on to linear scale, just as we mapped from probabilities to log-odds using the logit transformation. This is the link function, \(g()\)
The link function maps from the response scale to the linear predictor scale
\[g(\mu_i) = \eta_i\]
To map from the linear predictor scale back to the response we need to divide by \(g()\), which means we need the inverse of \(g\), \(g^{-1}\):
\[\mu_i = g^{-1}(\eta_i)\]
The inverse of log is exp, of logit is expit, etc. but R handles this for you
Sounds complicated but look again at the logistic regression
\[\begin{align*} \log \left ( \frac{\mu_i}{1 - \mu_i} \right ) & = \eta_i \\ \mathrm{logit}(\mu_i) & = \eta_i \\ \mu_i & = \mathrm{logit}^{-1}(\eta_i) \end{align*}\]
The logit transformation linearises the relationship & its inverse maps back from the linear scale to the non-linear response scale
Other link functions are suitable for use with other types of GLMs; in a Poisson GLM, as we’ll see next, common to use the log link function
Imagine we observed how the counts of some thing (event, cases of disease etc) varied with a covariate \(x\)
What if we fit a Gaussian linear model to these data?
The model diagnostics are not great either
We fit a Poisson GLM to these data
Binomial GLM (0/1 data) — logistic regression
Binomial GLM (counts from a total) — Number of still born piglets in litters
Gamma GLM — milk yield, weight, etc.
What would diagnostics for a Gaussian model fitted to those data look like?
Inverse Gaussian GLM — milk yield, weight, etc.: like the gamma, just more extreme
Often, we’ll encounter count data of some sort; number of cases of a disease in an epidemic
Counts are strictly non-negative; can’t have a count of less than 0
Variance increases as a function of the mean (lambda; \(\lambda\))
The Poisson distribution is a standard distribution for modelling count data
The Poisson gives the distribution of the number of things (individuals, counts, events) in a given sampling interval/effort if each event is independent
The Poisson is a simple distribution; defined by a single parameter \(\lambda\), which is the mean and the variance
The standard link function for a Poisson GLM is the log link; maps non-negative counts on to a linear scale
\[\begin{align*} y_i & \sim \text{Poisson}(\mu_i) \\ \lambda_i & = \mathbb{E}(y_i) = \mu_i \\ \log(\mu_i) & = \beta_0 + \beta_1 x_1 + \cdots \end{align*}\]
Association between endometritis and breed of cattle in Danish dairy herds
# A tibble: 4 × 3
breed endometritis count
<fct> <fct> <dbl>
1 Holstein Yes 100
2 Holstein No 400
3 Jersey Yes 10
4 Jersey No 190
Data like this are common in biology — contingency tables
| Yes | No | Total | |
|---|---|---|---|
| Holstein | 100 | 400 | 500 |
| Jersey | 10 | 190 | 200 |
| Total | 110 | 590 | 700 |
A key question of interest is independence of the effects of variables on the cell frequencies
Analysis of deviance table similar to sequential SS from linear regression
Analysis of Deviance Table
Model: poisson, link: log
Response: count
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 3 523.43
breed 1 132.83 2 390.60 < 2.2e-16 ***
endometritis 1 361.54 1 29.05 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Add an interaction between breed and endometritis to the model & refit
Call:
glm(formula = count ~ breed + endometritis + breed:endometritis,
family = poisson(link = "log"), data = endo)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.99146 0.05000 119.829 < 2e-16 ***
breedJersey -0.74444 0.08811 -8.449 < 2e-16 ***
endometritisYes -1.38629 0.11180 -12.399 < 2e-16 ***
breedJersey:endometritisYes -1.55814 0.34317 -4.540 5.61e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5.2343e+02 on 3 degrees of freedom
Residual deviance: -3.9968e-14 on 0 degrees of freedom
AIC: 33.517
Number of Fisher Scoring iterations: 3
Compare models using likelihood ratio test
Analysis of Deviance Table
Model 1: count ~ breed + endometritis
Model 2: count ~ breed + endometritis + breed:endometritis
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 1 29.054
2 0 0.000 1 29.054 7.04e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare models using AIC
Mayekar & Kodandaramaiah (2017; PLoS One, 12, e0171482) studied determinants of pupal colour (green vs. brown) of a tropical butterfly
Response
Predictors
Discuss in your groups:
Moorehouse et al (2016; PLoS One, 11, e0165425) used genetic analyses to determine parentage of bear cubs and cross classified cubs and mothers and whether they caused problems around humans
Table shows the number of cubs and mothers classified by whether they are problematic or not
| Mother | Total | |||
|---|---|---|---|---|
| Yes | No | |||
| Offspring | Yes | 5 | 18 | 23 |
| No | 3 | 50 | 53 | |
| Total | 8 | 68 | 76 |
Discuss in your groups:
Teng et al (2020; PLoS One, 15, e0234190) surveyed cat owners in Australia on factors that might affect affect the prevalence of overweight and obese cats
Body Condition Scores
Begging behaviour
| Begging? | BCS 1 & 2 | BC 3 | BCS 4 & 5 |
|---|---|---|---|
| Never | 21 | 266 | 56 |
| Sometimes | 55 | 527 | 162 |
| Often | 14 | 106 | 69 |
| Always | 10 | 47 | 44 |
Discuss in your groups: